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c/3 . Abstract 

Counting the number of distinct elements (cardinality) in a dataset is a 
C^ [ fundamental problem in database management. In recent years, due to many 

^ ■ of its modern applications, there has been significant interest to address the 



distinct counting problem in a data stream setting, where each incoming data 
can be seen only once and cannot be stored for long periods of time. Many 
probabilistic approaches based on either sampling or sketching have been pro- 
posed in the computer science literature, that only require limited computing 
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and memory resources. However, the performances of these methods are not 
scale-invariant, in the sense that their relative root mean square estimation 
errors (RRMSE) depend on the unknown cardinalities. This is not desirable 
in many applications where cardinalities can be very dynamic or inhomoge- 
neous and many cardinalities need to be estimated. In this paper, we develop a 
novel approach, called self-learning bitmap (S-bitmap) that is scale-invariant 
for cardinalities in a specified range. S-bitmap uses a binary vector whose 
entries are updated from to 1 by an adaptive sampling process for inferring 
the unknown cardinality, where the sampling rates are reduced sequentially as 
more and more entries change from to 1. We prove rigorously that the S- 
bitmap estimate is not only unbiased but scale-invariant. We demonstrate that 
to achieve a small RRMSE value of e or less, our approach requires signifi- 
cantly less memory and consumes similar or less operations than state-of-the- 
art methods for many common practice cardinality scales. Both simulation 
and experimental studies are reported. 

Keywords: Distinct counting, sampling, streaming data, bitmap, Markov 
chain, martingale. 

1 Introduction 

Counting the number of distinct elements (cardinality) in a dataset is a fundamental 
problem in database management. In recent years, due to high rate data collection 
in many modern applications, there has been significant interest to address the dis- 
tinct counting problem in a data stream setting where each incoming data can be 
seen only once and cannot be stored for long periods of time. Algorithms to deal 
with streaming data are often called online algorithms. For example, in modern 



high speed networks, data traffic in the form of packets can arrive at the network 
link in the speed of gigabits per second, creating a massive data stream. A sequence 
of packets between the same pair of source and destination hosts and their applica- 
tion protocols form a flow, and the number of distinct network flows is an important 
monitoring metric for network health (for example, the early stage of worm attack 
often results a significant increas e in the number of network flows as infected ma- 



chines randomly scan others, see 



BuetaL 



(|2006h ). As another example, it is often 



useful to monitor connectivity patterns among network hosts and count the number 



of dis tinct peers that each host is communicating with over time (IKarasaridis et al. , 



20071) . in order to analyze the presence of peer-to-peer networks that are used for 
file sharing (e.g. songs, movies). 

The challenge of distinct counting in the stream setting is due to the constraint 
of limited memory and computation resources. In this scenario, the exact solution is 
infeasible, and a lightweight algorithm, that derives an approximate count with low 
memory and computational cost but with high accuracy, is desired. In particular, 
such a solution will be much preferred for counting tasks performed over Android- 
based smart phones (with onl y limited meniqry an d computing resources), which 



is in rapid growth nowadays (|Menten et al. . 



201 1|) . Another difficulty is that in 



many applications, the unknown cardinalities to be estimated may fall into a wide 
range, from 1 to N, where A^ ^ 1 is a known upper bound. Hence an algorithm 
that can perform uniformly well within the range is preferred. For instance, there 
can be millions of hosts (e.g. home users) active in a network and the number of 
flows each host has may change dramatically from host to host and from time to 
time. Similarly, a core network may be composed of many links with varying link 
speeds, and a traffic snapshot of the network can reveal variations between links 
by several orders of magnitude. (A real data example is given in Section |7J) It is 



problematic if the algorithm for counting number of flows works well (e.g. relative 
root mean square estimation errors are below some threshold) on some links while 
not on others due to different scales. 

There have been many solutions developed in the computer science literature to 



addres s the distinct counting problem in the s t ream setting, most notabl'slFlaiolet and Martin 



1985 ). 



(l2006h . 



Whang et al. 



Rajolet et al. 



(1990) 



Gibbons 



(1200 ll) . 



Durand and Flajoletl(l2003h . 



Estan et al\ 



(|2007h among others. Va rious asymptotical analyses have 



been carried out recently, see 



Kane et al. 



(|2010|) and references therein. The key 



idea is to obtain a statistical estimate by designing a compact and easy-to-compute 
summary statistic (also called sketch in computer s cience) from the stre a ming data. 



(120031) and 



Some of these methods (e.g. LogLog counting by iDurand and Flajoletl ^ 
Hyper-LogLog counting by iFlajolet et all (|2007b ) have nice statistical properties 
such as asymptotic unbiasedness. However, the performance of these existing so- 
lutions often depends on the unknown cardinalities and cannot perform uniformly 
well in the targeted range of cardinalities [1, A^]. For example, with limited memory. 



linear counting proposed by lWhang et al. 



( 1990 ) works best with small cardinalities 



while the LogLog counting method works best with large cardinalities. 

Let the performance of a distinct counting method be measured by its relative 
root mean square error (RRMSE), where RRMSE is defined by 

Re{n) = y/E{n~^h - 1)2 

where n is the distinct count parameter and h is its estimate. In this article we 
develop a novel statistics based distinct counting algorithm, called S-bitmap, that is 
scale-invariant, in the sense that RRMSE is invariant to the unknown cardinalities 
in a wide range without additional memory and computational costs, i.e. there exists 



a constant e > such that 



Rein) = e, forra 



AT. 



(1) 



S-bitmp uses the bitmap, i.e., a binary vector, to summarize the data for approximate 
counting, where the bina ry entrie s are ch anged from to 1 by an adaptive sampling 



process. In the spirit of 



Morris 



(|1978[) . the sampling rates decrease sequentially 
as more entries change to 1 with the optimal rate learned from the current state 
of the bitmap. The cardinality estimate is then obtained by using a non- stationary 
Markov chain model derived from S-bitmap. We use martingale properties to prove 
that our S-bitmap estimate is unbiased, and more importantly, its RRMSE is indeed 
scale-invariant. Both simulation and experimental studies are reported. To achieve 
the same accuracy as state-of-the-art methods, S-bitmap requires significantly less 
memory for many common practice cardinality scales with similar or less compu- 
tational cost. 

The distinct counting problem we consider here is we akly related to the tra 



ditiona l 'estimating the number of species ' problem, see 



1993h . 



Haas and Stokes 



mm . 



Bunge and Fitzpatrick 



Mad (|2006() and references therein. However, tra- 



ditional solutions that rely on sample sets of the population are impractical in the 
streaming context due to restrict ive memory and computati onal constraints. While 



traditional statistical studies (see 



Bickel and Doksuml 



200 1|) mostly focus on statis- 



tical inference given a measurement model, a critical new component of the solution 
in the online setting, as we study in this paper, is that one has to design much more 
compact summary statistics from the data (equivalent to a model), which can be 
computed online. 

The remaining of the paper goes as follows. Section [21 further ellaborates the 
background and reviews several competing online algorithms from the literature. 



Section [3] and in describe S-bitmap and estimation. Section [5] provides the dimen- 
sioning rule for S-bitmap and analysis. Section [6] reports simulation studies includ- 
ing both performance evaluation and comparison with state-of-the-art algorithms. 
Experimental studies are reported in Section |7J Throughout the paper, P and E de- 
note probability and expectation, respectively, ln(x) and log(a;) denote the natural 
logarithm and base-2 logarithm of x, and Table [T] lists most notations used in the 
paper. 

The S-bitmap algorithm has been successfully implemented in some Alcatel- 
Lucent net work monitoring prod ucts. A 4-page poster about the basic idea of S- 



bitmap (see 



Chen and Cad . 



20091) was presented at the International Conference on 



Data Engineering in 2009. 

2 Background 

In this section, we provide some background and review in details a few classes 
of benchmark online distinct counting algorithms from the existing literature that 
only require limited memory and computation. Readers familiar with the area can 
simply skip this section. 

2.1 Overview 

Let X = {xi, X2, ■ ■ ■ , xt} be a sequence of items with possible replicates, where 
Xi can be numbers, texts, images or other digital symbols. The problem of distinct 
counting is to estimate the number of distinct items from the sequence, denoted as 
n = I {xi '■ I < i < T}\. For example, if Xi is the i-th word in a book, then n is 
the number of unique words in the book. It is obvious that an exact solution can be 
obtained by listing all distinct items (e.g. words in the example). However, as we 



Variable Meaning 



m memory requirement in bits 

n cardinality to be estimated 

n S -bitmap estimate of n 

P, E, var probability, expectation, variance 



Re{n) A/lE(nn~^ — 1)^ (relative root mean square error) 

[0, A^] the range of cardinalities to be estimated 

C^^/^, e (expected, theoretic) relative root mean square error of S-bitmap 

V a bitmap vector 

Pb sequential sampling rate (l < b < m) 

St bucket location in V 

Lf number of Is in V after the t-th distinct item is hashed into V 

If indicator whether the t-th distinct item fills in an empty bucket in V 

Ct the set of locations of buckets filled with Is in V 

Tfe number of distinct items after h buckets are filled with Is in y 

tb expectation of T5 

Table 1 : Some notations used in the paper. 

can easily see, this solution quickly becomes less attractive when n becomes large 
as it requires a memory linear in n for storing the list, and an order of log n item 
comparisons for checking the membership of an item in the list. 

The objective of online algorithms is to process the incoming data stream in real 
time where each data can be seen only once, and derive an approximate count with 
accuracy guarantees but with a limited storage and computation budget. A typi- 
cal online algorithm consists of the following two steps. First, instead of storing the 
original data, one designs a compact sketch such that the essential information about 



the unknown quantity (cardinality in this case) is kept. The second step is an infer- 
ence step where the unknown quantity is treated as the parameter of interest, and the 
sketch is modeled as random variables (functions) associated with the parameter. In 
the fol lowing, we fi r st rev iew a class of bitmap algorithms including li near count- 



Estan et al. 



ing by IWhang et a/.l (|1990|) and multi-resolution bitmap (mr-bitmap) by 
(|2006|) . which are closely related to our new approach. Then we describe another 
class of Flajolet-Martin type algorithms. We also cover other methods briefly such 
as sampling that do not follow exactly the above online sketching frar nework. An 



excell e nt review of these and o ther exist i ng me thods can be found in 



Metwallv et al. 



(120081), 



Gibbons 



(|2009|) . and in particular. 



Bever et al. 



Metwallv et al. 



(|2008|) provides extensive simulation comparisons. Our new approach will be com- 



pared with three state-of-the-art algorithms from the first two classes of methods: 
mr-bitmap, LogLog counting and Hyper-LogLog counting. 

2.2 Bitmap 



Astrahan et al. 



09871) 



The bitmap scheme for distinct counting was first pr oposed in 
and then analyzed in details in lWhang et al.\ (Il990|) . To estimate the cardinality of 
the sequence, the basic idea of bitmap, is to first map the n distinct items uniformly 
randomly to m buckets such that replicate items are mapped to the same bucket, and 
then estimate the cardinality based on the number of non-empty buckets. Here the 
unifo rm random mapping is achieved using a universal hash function (see iKnuthl 
19981), which is essentially a pseudo uniform random number generator that takes 
a variable-size input, called 'key' (i.e. seed), and returning an integer distributed 
uniformly in the range of [1, m]ll| To be convenient, let h : X -^ {1, ■ ■ ■ , rn} be 



'As an example, by taking the input datum x as an integer, the Carter- Wegman hash function is 
as follows: h{x) = {{ax + b) mod p) mod m, where p is a large prime, and a, b are two arbitrarily 
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a universal hash function, where it takes a key x E X and map to a hash value 
h(x). For theoretical analysis, we assume that the hash function distributes the 
items randomly, e.g. for any x,y E X with x ^ y, h{x) and h{y) can be treated as 
two independent uniform random numbers. A bitmap of length m is simply a binary 
vector, say V = {V[l], . . . , V[k], . . . , V[m]) where each element V[k] G {0, 1}. 

The basic bitmap algorithm for online distinct counting is as follows. First, 
initialize V[k] = for /;; = 1, ■ ■ ■ ,m. Then for each incoming data x E X, 
compute its hash value k = h{x) and update the corresponding entry in the bitmap 
V\k] by setting V[k] = 1. For convenience, this is summarized in Algorithm [T] 
Notice that the bitmap algorithm requires a storage of m bits attributed to the bitmap 
and requires no additional storage for the data. It is easy to show that each entry 
in the bitamp V[k] is BernouUi{l — (1 — m^^)"), and hence the distribution of 
\V\ = J2T=i ^[^] only depends on n. Various estimates of n have been developed 
based on \V\, for example, linear counting as mentioned above uses the estimator 
mln(m(m — |V^|)^^). The name 'linear counting' comes from the fact that its 
memory requirement is almost linear in n in order to obtain good estimation. 

Typically, A^ is much larger than the required memory m (in bits), thus mapping 
from {0, ■ ■ ■ , m} to {1, ■ ■ • , A^} cannot be one-to-one, i.e. perfect estimation, but 
one-to-multiple. A bitmap of size m can only be used to estimate cardinalities 
less than mlogm with certain accuracy. In order to make it scalable to a larger 
card inality scale, a few improved methods based on bitmap have been developed 



(see 



Estan et al. . 



2006f) . One method, called virtual bitmap, is to apply the bitmap 
scheme on a subset of items that is obtained by sampling original items with a 
given rate r. Then an estimate of n can be obtained by estimating the cardinality 



chosen integers modulo p with a y^ 0. Here x is the key and the output is an integer in {1, • • • , m} 
if we replace with m. 



Algorithm 1 Basic bitmap 



Input: a stream of items x 

V (a bitmap vector of zeros with size m) 
Output: \V\ (number of entries with Is in V) 
Configuration: m 



for X G A" do 

compute its hash value k = h{x) 
a V[k]=G then 
update V[k\ = l 

Return |y I = X:r=i"^M- 



of the sampled subset. But it is impossible for vi rtual bitmap with a single r to 



Estan et al. 



(|2006|) proposed a 



estimate a wide range of cardinalities accurately, 
multiresolution bitmap (mr-bitmap) to improve virtual bitmap. The basic idea of 
mr-bitmap is to make use of multiple virtual bitmaps, each with a different sampling 
rate, and embeds them into one bitmap in a memory-efficient way. To be precise, it 
first partitions the original bitmap into K blocks (equivalent to K virtual bitmaps), 
and then associates buckets in the A;-th block with a sampling rate r^ for screening 
distinct items. It may be worth pointing out that mr-bitmap determines K and the 
sampling rates with a quasi-optimal strategy and it is still an open question how 
to optimize them, which we l eave for future study. Though there is no rigorous 



analysis in 



Estan et al. 



(|2006|) . mr-bitmap is not scale-invariant as suggested by 



simulations in Section [6l 
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2.3 Flajolet-Martin type algorithms 



The approach of iFlajolet and Martini (Il985|) (FM) has pioneered a different class of 



algorithms. The basic idea of FM is to first map each item x to a geometric random 
number g, and then record the maximum value of the geometric random numbers 
max((7), which can be updated sequentially. In the implementation of FM, upon 
the arrival of an item x, the corresponding g is the location of the left-most 1 in 
the binary vector h(x) (each entry of the binary vector follows Bernoulli{l/2)), 
where his a universal hash function mentioned earlier. Therefore F(g = k) = 2~^. 
Naturally by hashing, replicate items are mapped to the same geometric random 
number. The maximum order statistic iaa,x(g) is the summary statistic for FM, also 
called the FM sketch in the literature. Note that the distribution of iaax{g) is com- 
pletely determined by the number of distinct items. By randomly partitioning items 
into m groups, the FM approach obtains m maximum random numbers, one for 
each group, which are independent and identically distributed, and then estimates 
the distinct count by a moment method. Since FM makes use of the binary value of 
h{x), which requires at most log(A^) bits of memory where A^ is the upper bound 
of distinct counts (taking as power of 2), it is also called log-counting. Various 
extensions of the FM approach have been explored in the literatur e based on the 



k-th maxirnum ord er statistic, where k = 1 corresponds to FM (see 



Beyer et al. . 



2009 ) 



Giroire . 



2005 



Flajolet and his collaborators have recently proposed two innovative methods, 
called LogLog counting and Hyper-LogLog as mentioned above, published in 2003 
and 2007, subsequently. Both methods use the technique of recording the binary 
value of g directly, which requires at most log (log A^) bits (taking A^ such that 
log(log A^) is integer), and therefore are also called loglog-counting. This provides 
a more compact summary statistic than FM. Hyper-LogLog is built on a more effi- 

11 



cient estimator than LogLog, see lFlajolet et all (l2007h for the exact formulas of the 
estimators. 

Simulations suggest that although Hyper-LogLog may have a bounded RRMSE 
for cardinalities in a given range, its RRMSE fluctuates as cardinalities change and 
thus it is not scale-invariant. 



2.4 Distinct sampling 



The paper of lFlajoleti (|1990|) proposed a novel sampling algorithm, called Wegman 's 
adaptive sampling, which collects a random sample of the distinct elements (binary 
values) of size no more than a pre- specified number. Upon arrival of a new dis- 
tinct element, if the sample size of the existing collection is more than a threshold, 
the algorithm will remove some of the collected sample and the new element will 
be inserted with a sampling rate 2^^, where k starts fro m and g r ows a daptively 



Gibbons 



(1200 ih uses the 



according to available memory. The distinct sampling of li 
same idea to collect a random sample of distinct elements. These sampling algo- 
rithms are essentially different from the above two classes of algorithms based on 
one-scan sketches, and are computationally less attractive as they require scanning 
all existing collection periodically. They belong to the log-counting family with 
memory cost in the order of e^^ log(A^) where e is an asymptotic RRMSE, but their 
asy mptotic memory effi ciency is somewhat worse tha n the original F M method. 



see 



Flajolet et al\ (|2007h for an asymptotic comparison. iFlajolet 



19901) has shown 



that with a finite population, the RRMSE of Wegman's adaptive sampling exhibits 
periodic fluctuations, depending on unknown cardinalities, and thus it is not scale 
invariant as defined by ([T]). Our new approach makes use of the general idea of 
adaptive sampling, but is quite different from these sampling algorithms, as ours 
does not require collecting a sample set of distinct values, and furthermore is scale 
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invariant as shown later. 

3 Self-learning Bitmap 

As we have explained in Section [Z2l the basic bitmap (see Algorithm [T]), as well as 
virtual bitmap, provides a memory-efficient data summary but they cannot be used 
to estimate cardinalities accurately in a wide range. In this section, we describe a 
new approach for online distinct counting by building a self-learning bitmap (S- 
bitmap for abbreviation), which not only is memory-efficient, but provides a scale- 
invariant estimator with high accuracy. 

The basic idea of S-bitmap is to build an adaptive sampling process into a bitmap 
as our summary statistic, where the sampling rates decrease sequentially as more 
and more new distinct items arrive. The motivation for decreasing sampling rates 
is easy to perceive - if one draws Bernoulli sample with rate p from a population 
with unknown size n and obtains a Binomial count, say X ~ Binomial{n,p), 
then the maximum likelihood estimate p^^X for n has relative mean square error 
E,(n^^p^^X — 1)^ = (1 — p)/{np). So, to achieve a constant relative error, one 
needs to use a smaller sampling rate p on a la rger popu l ation with size n. The sam- 



pling idea is similar to "adaptive sampling" of lMorrisI (|l978h which was proposed 
for counting a large number of items with no item-duplication using a small mem- 
ory space. However, since the main issue of distinct counting is item-duplication, 
Morris' approach does not apply here. 

Now we describe S-bitmap and show how it deals with the item-duplication 
issue effectively. The basic algorithm for extracting the S-bitmap summary statistic 
is as follows. Let 1 > pi > P2 > ■ ■ ■ > Pm > be specified sampling rates. 
A bitmap vector V G {0, l}™ with length m is initialized with and a counter 
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L is initialized by for the number of buckets filled with Is. Upon the arrival of 
a new item x (treated as a string or binary vector), it is mapped, by a universal 
hash function using x as the key, to say k E {!,■ ■ ■ , m}. If V[k] = 1, then skip 
to the next item; Otherwise, with probability p^, V[k] is changed from to 1, in 
which case L is increased by 1. (See Figure [T] for an illustration.) Note that the 
sampling is also realized with a universal hash function using x as keys. Here, 
L E {0, 1, ■ ■ ■ ,m} indicates how many 1-bits by the end of the stream update. 
Obviously, the bigger L is, the larger the cardinality is expected to be. We show in 
Section |4] how to use L to characterize the distinct count. 

If m = 2"^ for some integer c, then S-bitmap can be implemented efficiently as 
follows. Let d be an integer. For each item x, it is mapped by a universal hash 
function using x as the key to a binary vector with length c + d. Let j and u be two 
integers that correspond to the binary representations with the first c bits and last d 
bits, respectively. Then j is the bucket location in the bitmap that the item is hashed 
into, and u is used for sampling. It is easy to see that j and u are independent. If the 
bucket is empty, i.e. V[j] = 0, then check whether u2^'^ < Pl+i and if true, update 
V [j] = 1 . If the bucket is not empty, then just skip to next item. This is summarized 
in Algorithm [21 where the choice of (pi, ■ ■ ■ ,pm) is d escribed in Section [5 1 Here 



we follow the setting of the LogLog counting paper by [Purand and Flajoleti (|2003|) 
and take X = {0, lY^'^. There is a chance of collision for hash functions. Typically 
d = 30, which is small relative to m, is sufficient for A^ in the order of millions. 

Since the sequential sampling rates p^ only depend on L which allows us to 
learn the number of distinct items already passed, the algorithm is called Self- 
learning bitmap (S-bitmap). a We note that the decreasing property of the sampling 



Statistically, the self learning process can also be called adaptive sampling. We notice that 



Estan et al. 



(l2006h have used 'adaptive bitmap' to stand for a virtual bitmap where the sampling rate 
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m 



Case 1: x^ is mapped to a bucket with value 



m 



Case 2: x; is mapped to a bucket v^th value 



Figure 1: Update of the bitmap vector: in case 1, just skip to the next item, and in 
case 2, with probability p^ where L is the number of Is in V so far, the bucket value 
is changed from to 1 . 



rates, beyond the above heuristic optimality, is also sufficient and necessary for fil- 
tering out all duplicated items. To see the sufficiency, just note if an item is not 
sampled in its first appearance, then the rf-bits number associated with it (say u, in 
line 5 of Algorithm O is larger than its current sampling rate, say pl- Thus its later 
replicates, still mapped to u, will not be sampled either due to the monotone prop- 
erty. Mathematically, if the item is mapped to u with u2^'^ > pi, then u2^'^ > pi+i 
since Ph+i < Pl- On the other hand, if p^+i > pl, then in line 7 of Algorithm 



Flaiolea (1 19901) has used ' adaptive 



is chosen adaptively based on another rough estimate, and that 

samphng' for subset sampling. To avoid potential confusion with these, we use the name 'self 

learning bitmap' instead of 'adaptive sampling bitmap'. 
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Algorithm 2 S-bitmap (SKETCHING UPDATE) 



Input: a stream of items x (hashed binary vector with size c + d) 

V (a bitmap vector of zeros with size m = 2^) 
Output: B (number of buckets with Is in V) 
Configuration: m 



1: 


Initialize L = 


2: 


for X = bi- ■ ■ bc+d e A' do 


3: 


set j := [bi- ■ ■ 6c] 2 (integer value of first c bits in base 2) 


4: 


ifV[j] =Otheii 


5: 


u = [bc+i ■ ■ ■ bc+dh 


6: 


# sampling # 


7: 


ifu2-'^ <PL+itheii 


8: 


V[j] = 1 


9: 


L = L + 1 


10: 


Return B = L. 



[2l IP(pl < m2^'^ < Pl+i) > 0, that is, there is a positive probability that the item 
mapped to u, in its first appearance, is not sampled at L, but its later replicate is 
sampled at L + 1, which establishes the necessity. The argument of sufficiency here 
will be used to derive S-bitmap's Markov property in Section HTTI which leads to the 
S-bitmap estimate of the distinct count using L. 

It is interesting to see that unlike mr-bitmap, the sampling rates for S-bitmap 
are not associated with the bucket locations, but only depend on the arrival of new 
distinct items, through increases of L. In addition, we use the memory more effi- 
ciently since we can adaptively change the sampling rates to fill in more buckets, 
while mr-bitmap may leave some virtual bitmaps unused or some completely filled, 
which leads to some waste of memory. 
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We further note that in the S-bitmap update process, only one hash is needed for 
each incoming item. For bucket update, only if the mapped bucket is empty, the last 
(i-bits of the hashed value is used to determine whether the bucket should be filled 
with 1 or not. Note that the sampling rate changes only when an empty bucket is 
filled with 1 . For example, if K buckets become filled by the end of the stream, 
the sample rates only need to be updated K times. Therefore, the computational 
cost of S-bitmap is very low, and is similar to or lower than that of benchmark 
algorithms such as mr-bitmap, LogLog and Hyper-LogLog (in fact, Hyper-LogLog 
uses the same summary statistic as LogLog and thus their computational costs are 
the same). 

4 Estimation 

In this section, we first derive a Markov chain model for the above L sequence and 
then obtain the S-bitmap estimator. 

4.1 A non-stationary Markov chain model 

From the S-bitmap update process, it is clear that the n distinct items are randomly 
mapped into the m buckets, but not all corresponding buckets have values 1. From 
the above sufficiency argument, due to decreasing sampling rates, the bitmap filters 
out replicate items automatically and its update only depends on the first arrival 
of each distinct item, i.e. new item. Without loss of generality, let the n distinct 
items be hashed into locations Si, 82,- ■ ■ , Sn with 1 < 5*^ < m, indexed by the 
sequence of their first arrivals. Obviously, the Si are i.i.d.. Let It be the indicator of 
whether or not the t-th distinct item fills an empty bucket with 1. In other words. 
It = I if and only if the t-th distinct item is hashed into an empty bucket (i.e. 
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with value 0) and further fills it with 1. Given the first t — 1 distinct items, let 
C{t - 1) = {Sj : Jj = 1, 1 < j < t - 1} be the buckets that are filled with 1, and 
L(_i = \C{t — 1)1 be the number of buckets filled with 1. Then Lt = Lt-i + It- 
Upon the arrival of the t-th distinct item that is hashed to bucket location St, if St 
does not belong to C{t — 1), i.e, the bucket is empty, then by the design of S-bitmap, 
It is independent of St- To be precise, as defined in line 3 and 5 of Algorithm [2l 
j and u associated with x are independent, one determining the location St and 
the other determining sampling It- Obviously, according to line 7 of Algorithm [21 
the conditional probability that the t-th distinct item fills the S'j-th bucket with 1 is 
Pit-i+i^ otherwise is 0, that is, 

^{It = l\StiC{t-l),Lt-i) = pL,.,+i 

and 

F{It = l\SteCit-l),Lt-i) = 0. 

The final output from the update algorithm is denoted by 5, i.e. 

n 

B = Ln = 2_^ hi 

t=l 

where n is the parameter to be estimated. 

Since St and C{t — 1) are independent, we have 

¥{It = l\Lt-i) 

= nh = i\St i c{t - 1), Lt.{)nst i c{t - i)\Lt-i) 

Lt-1, 



PLt^i+1 ■ (1 



m 



This leads to the Markov chain property of Lt as summarized in the theorem below. 
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Theorem 1 Let g^ = (1 — m^^{k — l))pkfor k = 1, ■ ■ ■ ,m. If the monotonicity 
condition holds, i.e. pi > P2 > ■■■, then {Lt : t = 1, ■ ■ ■ ,n} follows a non- 
stationary Markov chain model: 

Lt = Lt-i + 1, with probability qit^i+i 
= Lt-i, with probability 1 — qLt_i+i- 

4.2 Estimation 

Let Tfc be the index for the distinct item that fills an empty bucket with 1 such that 
there are k buckets filled with 1 by that time. That is, {T^. = t} is equivalent to 
{Lt-i = k — 1 and It = 1}. Now given the output B from the update algorithm, 
obviously Tb <n < Tb+i- A natural estimate of n is 

h = tB, (2) 

where 4 = ETb, 6 = 1,2, ■••. 

Let To = and to = for convenience. The following properties hold for Tj, 
and tb. 

Lemma 1 Under the monotonicity condition of {pk}, T^. — T^^i, for 1 < k < m 
are distributed independently with geometric distributions, and for 1 < t < m, 

F{Tk-Tk-i = t) = {l-qkY-^k. 

The expectation and variance ofT{„l<b<m can be expressed as 

b 
k=l 

and 

b 

var(Tb) = ^(l-gfc)g^l 

fc=i 
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The proof of Lemma [U follows from the standard Markov chain theory and is 
provided in the appendix for completeness. Below we analyze how to choose the 
sequential sampling rates {pi, ■ ■ ■ ,Pm} such that Re{n) is stabilized for arbitrary 

ne{l,--- ,N}. 

5 Dimensioning rule and analysis 

In this section, we first describe the dimensioning rule for choosing the sampling 
rates {pk}- Notice that Tb is an unbiased estimate of U = E,Tb if Tb is observ- 
able but tb is unknown, where ti < t2 < ■ ■ ■ < tm- Again formally denote 



Re{Tb) = JEiTbf^^ - ly as the relative error. In order to make the RRMSE 
of S-bitmap invariant to the unknown cardinality n, our idea is to choose the sam- 
pling rates {pk} such that Re{Tb) is invariant for 1 < 6 < m, since n must fall 
in between some two consecutive T5S. We then prove that although Tb are unob- 
servable, choosing parameters that stabilizes Re{Tb) is sufficient for stabilizing the 
RRMSE of S-bitmap for all n G {1, ■ ■ ■ , A^}. 

5.1 Dimensioning rule 

To stabilize Re(Tb), we need some constant C such that for b = 1, ■ ■ ■ , m, 

Re{Tb) = C-^l\ (3) 

This leads to the dimensioning rule for S-bitmap as summarized by the following 
theorem, where C is determined later as a function of N and m. 

Theorem 2 Let {T^ — T^-i '■ 1 < k < m} follow independent Geometric distribu- 
tions as in LemmaUl Let r = 1 — 2{C + 1)~^ If 

m + 1 — k 
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then we have for k = 1, ■ ■ ■ ,m, 



ETk ■ 



That is, the relative errors Re{Tb) do not depend on b. 

Proof Note that dH) is equivalent to 

var(Th+i) _ var{Tb) 

By Lemma [H this is equivalent to 

var{Tb) + (1 - gb+i)gb"+i _ var{Ti,) 

Since var{Tb) = C^'^tl, then 



-1 _ C 2th ,^. 



Since t^+i = 1^ + q^li, we have 



C+1 C 



U+l — -TZ -rtb + 



C-1 C-1 

By deduction, 



Cft+i 



(§^)'(--'-)-?- 



Since var{Ti) = (1 - g^gf^ = C'Hl and ti = gr\ we have ti = C{C - l)-\ 
Hence with some calculus, we have, for r = 1 — 2{C + 1)^^, 

qh = {l + C-')r'. 
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Since gt = (1 — ^^)pb, the sequential sampling rate pb, for b = 1, ■ ■ ■ , m, can be 
expressed as 

m 



Pb 



{1 + c-y. 



m+ I — b 
The conclusion follows as the steps can be reversed. 

It is easy to check that the monotonicity property holds strictly for {p^ : 1 < 
k < m — 2^^C}, thus satisfying the condition of Lemma[Tl For k > m — 2^^C, 
the monotonicity does not hold. So it is natural to expect that the upper bound A^ 
is achieved when m — 2^^C buckets (suppose C is even) in the bitmap turn into 1, 
i.e. t„_2-ic = N, or, 

N = ^ (r-(-~2-^^) _ i) . (6) 

Since r = 1 — 2(C + 1)^^, we obtain 

C ln(l + 2NC-^) 
"^ = 2+ln(l + 2(C-l)-)- ^'^ 

Now, given the maximum possible cardinality N and bitmap size m, C can be 

solved uniquely from this equation. 

For example, if A^ = 10^ and m = 30, 000 bits, then from (|7]) we can solve 
C ~ 0.01^^. That is, if the sampling rates {p^} in Theorem [2] are designed using 
such {m,N), then Re{n) can be expected to be approximately 1% for all n G 
{1, ■ ■ ■ , 10®}. In other words, to achieve errors no more than 1% for all possible 
cardinalities from 1 to A^, we need only about 30 kilobits memory for S -bitmap. 

Since ln(l + a;) ~ a;(l — ^x) for x close to 0, (|7]) also implies that to achieve a 
small RRMSE e, which is equal to (C — 1)^^/^ according to Theorem |3] below, the 
memory requirement can be approximated as follows: 

m ^ -e"2(l + ln(l + 2A^e^)). 
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Therefore, asymptotically, the memory efficiency of S -bitmap is much better than 
log-counting algorithms which requires a memory in the order of e~^ log A^. Fur- 
thermore, assuming Ne""^ > 1, if e < ^ {\og N)"^ / {2eN) where rj ^ 3.1206, 
S-bitmap is better than Hyper- LogLog counting wh ich requires memory approxi- 
mately 1.04^e~^log(logA^) (see 



Flajolet et al. . 



20071 ) in order to achieve an asymp- 



totic RRMSE e, otherwise is worse than Hyper-LogLog. 

Remark. In implementation, we set p^ = Pm-2-ic for "^ ~ 2^^C < b < m so 
that the sampling rates satisfy the monotone property which is necessary by Lemma 
[B Since the focus is on cardinalities in the range from 1 to A^ as pre-specified, which 
corresponds to B < m — 2^^C as discussed in the above, we simply truncate the 
output L„ by m — 2^^C if it is larger than this value which becomes possible when 
n is close to N, that is, 

B = mm{Ln,m-2'^C). (8) 

5.2 Analysis 

Here we prove that the S-bitmap estimate is unbiased and its relative estimation 
error is indeed "scale-invariant" as we had expected if we ignore the truncation 
effect in ^ for simplicity. 

Theorem 3 Let B = Ln, where Ln is the number of 1 -bits in the S-bitmap, as 
defined in Theorem \l\for 1 < n < N. Under the dimensioning rule of Theorem |2] 
for the S-bitmap estimator h = ts as defined in ^, we have 

Kn = n 

RRMSE{n) = (C - l)-i/2. 
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Proof Let for a > 1 

3=0 

By Theoremfll L„+i = i + Bernoulli{qi+i) if Ln = i. Thus 

= E(y„/(L„+i = t) + F„(l + (a - l)qllJliL^+, = i + 1)|L„) 
= y„{l-gi+i + g,+i(l + (a-l)g,:+\)} 
= >^na 

if Ln = i. Therefore {a^"y„ : n = 0, 1, ■ ■ ■ } is a martingale. 

Note that g^ = (1 + C'^y, i > 0, where r = 1 - 2(C + l)-i. Since Lq = 0, 
EFq = 1 + (a — l)*?!]^"*^ and since a""EF„ = EFq, we have 

EYn = a"(l + (a - l)go"') 
that is, 

a-(l + (a - l)g-i) = Ej](l + (a-l)g-i). 

Recall that t^ = ^,'=1 Qj' and ^^^^ q-\l - q,) = C-\Y.]=i Qj'?- Taking first 
derivative at a = 1+, we have (since B = L„) 

n + q^^ = E ^ qj^ = Et^ + q^^ 

j=0 

and taking second derivative at a = 1+, we have 

n{n-l) + 2nq^^ = E( J^gr^)^ - E^^g^^ 

j=0 j=0 

= ¥.{tB + q^^f-n%^ + tB + C-Hl). 
24 



Therefore, Mb = n and Et| = n'^C/{C - 1). Thus 

var{tB) 



n^ 



C-1 
Remark. This elegant martingale argument already appeared in 



Rosenkrantz 



(|1987r) but under a different and simpler setting, and we rediscovered it. 

In implementation, we use the truncated version of B, i.e. ([8]), which is equiv- 
alent to truncating the theoretical estimate by A^ if it is greater than A^. Since by 
assumption the true cardinalities are no more than A^, this truncation removes one- 
sided bias and thus reduces the theoretical RRMSE as shown in the above theorem. 
Our simulation below shows that this truncation effect is practically ignorable. 

6 Simulation studies and comparison 

In this section, we first present empirical studies that justify the theoretical analysis 
of S-bitmap. Then we compare S-bitmap with state-of-the-art algorithms in the 
literature in terms of memory efficiency and the scale invariance property. 

6. 1 Simulation validation of S-bitmap 's theoretical performance 

In the above, our theoretical analysis shows that without truncation by A^, the S- 
bitmap has a scale-invariant relative error e = (C — 1)^^/^ for n in a wide range 
[1, A^], where C satisfies Equation ^ given bitmap size m. We study the S-bitmap 
estimates based on ([8]) with two sets of simulations, both with A^ = 2^° (about one 
million), and then compare empirical errors with the theoretical results. In the first 
set, we fix m = 4, 000, which gives C = 915.6 and e = 3.3%, and in the second 
set, we fix m = 1,800, which gives C = 373.7 and e = 5.2%. We design the 
sequential sampling rates according to Section [STl For 1 < n < A^, we simulate n 
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o Simulated error (m=4000) 

- - Theoretical error (3.3%) 

+ Simulated error (m=1 800) 

■ ■ Theoretical error (5.2%) 



• «- "• -D -0-0-0- 5- 0-0- - 
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Cardinality (log base 2) 

Figure 2: Empirical and theoretical estimation errors of S-bitmap with m = 4, 000 
bits and m = 1, 800 bits of memory for estimating cardinalities 1 < n < 2"^^. 

distinct items and obtain S-bitmap estimate. For each n (power of 2), we replicate 
the simulation 1000 times and obtain the empirical RRMSE. These empirical errors 
are compared with the theoretical errors in Figure [2l The results show that for 
both sets, the empirical errors and theoretical errors match extremely well and the 
truncation effect is hardly visible. 

6.2 Comparison with state-of-the-art algorithms 

In this subsection, we demonstrate that S-bitmap is more efficient in terms of mem- 
ory and accuracy, and more reliable than state-of-the-art algorithms such as mr- 
bitmap, LogLog and Hyper-LogLog for many practical settings. 

Memory efficiency Hereafter, the memory cost of a distinct counting algorithm 
stands for the size of the summary statistics (in bits) and does not count for hash 
functions (whose seeds require some small memory space), and we note that the 
algorithms to be compared here all require at least one universal hash function. 



26 



N 


e = 


--1% 


e = 


= 3% 


e = 


--9% 




HLLog 


S -bitmap 


HLLog 


S -bitmap 


HLLog 


S -bitmap 


10=^ 


432.6 


59.1 


48.1 


11.3 


5.3 


2.4 


10^ 


432.6 


104.9 
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5.2 
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540.8 
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Table 2: Memory cost (with unit 100 bits) of Hyper-LogLog and S-bitmap with 
given A^, e. 




E(in percent) 

Figure 3: Contour plot of the ratios of the memory cost of Hyper-LogLog to that 
of S-bitmap with the same (A^, e): the contour line with small circles and label '1' 
represents the contour with ratio values equal to 1 . 

From (|7]), the memory cost for S-bitmap is approxim ately linear in log( 2 iV/C) . 



By the theory developed in 



Durand and Flajoletl (120031) and 



Flajolet et al. 



(l2007h . 



the space requirements for LogLog counting and Hyper-LogLog are approximately 
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1.30^ X ae"2 ^nd 1.04^ x ae"^ in order to achieve RRMSE e = (C - 1)"^/^ where 

a = 5, if 2^6 <N <2^'^, 
= 4, if 2^ <N<2^^ 

Here a = fc + 1 if 2 < N < 2 for any positive integer k. So LogLog requires 
about 56% more memory than Hyper-LogLog to achieve the same asymptotic error. 
There is no analytic study of the memory cost for mr-bitmap in the literature, thus 
below we report a thorough memory cost comparison only between S-bitmap and 
Hyper-LogLog. 

Given A^ and e, the theoretical memory costs for S-bitmap and Hyper-LogLog 
can be calculated as above. Figure |3] shows the contour plot of the ratios of the 
memory requirement of Hyper-LogLog to that of S-bitmap, where the ratios are 
shown as the labels of corresponding contour lines. Here e x 100% is shown in 
the horizontal axis and N is shown in the vertical axis, both in the scale of log 
base 2. The contour line with small circles and label '1' shows the boundary where 
Hyper-LogLog and S-bitmap require the same memory cost m. The lower left side 
of this contour line is the region where Hyper-LogLog requires more memory than 
S-bitmap, and the upper right side shows the opposite. Table |2] lists the detailed 
memory cost for both S-bitmap and Hyper-LogLog in a few cases where e takes 
values 1%, 3% and 9%, and N takes values from 1000 to 10^. For example, for 
A^ = 10^ and e < 3%, which is a suitable setup for a core network flow monitoring, 
Hyper-LogLog requires at least 27% more memory than S-bitmap. As another 
example, for A^ = 10^ and e < 3%, which is a reasonable setup for household 
network monitoring, Hyper-LogLog requires at least 120% more memory than S- 
bitmap. In summary, S-bitmap is uniformly more memory-efficient than Hyper- 
LogLog when A^ is medium or small and e is small, though the advantage of S- 
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bitmap against Hyper-LogLog dissipates with A^ > 10^ and large e. 

Scale-invariance property In many applications, the cardinalities of interest are in 
the scale of a million or less. Therefore we report simulation studies with N = 2"^^. 
In the first experiment, m = 40, 000 bits of memor y is used for all fou r algorithms. 



Estan et al. 



(120061) . Let the true 



The design of mr-bitmap is optimized according to[ 
cardinality n vary from 10 to 10^ and the algorithms are run to obtain correspond- 
ing estimates n and estimation errors n^^n — 1. Empirical RRMSE is computed 
based on 1000 replicates of this procedure. In the second and third experiments, 
the setting is similar except that m = 3, 200 and m = 800 are used, respectively. 
The performance comparison is reported in Figure HI The results show that in the 
first experiment, mr-bitmap has small errors than LogLog and HyperLogLog, but S- 
bitmap has smaller errors than all competitors for cardinalities greater than 40,000; 
In the second experiment, Hyper-LogLog performs better than mr-bitmap, but S- 
bitmap performs better than all competitors for cardinalities greater than 1,000; 
And in the third experiment, with higher errors, S -bitmap still performs slightly 
better than Hyper-LogLog for cardinalities greater than 1 ,000, and both are better 
than mr-bitmap and LogLog. Obviously, the scale invariance property is validated 
for S-bitmap consistently, while it is not the case for the competitors. We note that 
mr-bitmap performs badly at the boundary, which are not plotted in the figures as 
they are out of range. 

Other performance measures Besides RRMSE, which is the L2 metric, we have 
also evaluated the performance based on other metrics such as E|n"^ri — 1 1, namely 
the Li metric, and the quantile of \n^^n — 1\. As examples. Table [3] and Table 
m report the comparison of three error metrics (Li, L2 and 99% quantile) for the 
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Figure 4: Comparison among mr-bitmap, LogLog, Hyper-LogLog and S-bitmap for 
estimating cardinalities from 10 to 10® with m = 40, 000, m = 3, 200 and m = 800 
respectively. 

cases with (A^ = 10"^, m = 2700) and (A^ = 10®, m = 6720), which represent 
two settings of different scales. In both settings, mr-bitmap works very well for 
small cardinalities and worse as cardinalities get large, with strong boundary effect. 
Hyper-LogLog has a similar behavior, but is much more reliable. Interestingly, 
empirical results suggest that the scale-invariance property holds for S-bitmap not 
only with RRMSE, but approximately with the metrics of Li and the 99% quantile. 
For large cardinalities relative to N, the errors of Hyper-LogLog are all higher than 
that of S-bitmap in both settings. 
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L2 
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10 


1.3 


0.6 


0.8 


2.6 


1.6 


3 


10 


10 


10 


100 


2.1 


1.4 


2.5 


2.6 


1.7 


3.2 


6 


4 


8 


1000 


2.1 


1.6 


3.5 


2.6 


2 


4.4 


6.7 


5 


11.4 
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11.3 
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2.6 
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4.3 


6.9 


119 


11.2 
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2.1 
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3.5 


2.6 


102.4 


4.4 


6.6 


131.1 


11.5 



Table 3: Comparison of Li, L2 metrics and 99%-quantiles (times 100) among mr- 
bitmap (mr), Hyper-LogLog (H) and S-bitmap (S) for N = iC and m = 2700. 
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Table 4: Comparison of Li, L2 metrics and 99%-quantiles (times 100) among mr- 
bitmap (mr), Hyper-LogLog (H) and S-bitmap (S) for N = 10^ and m = 6720. 
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A Truth 
— S-bitmap estimates 
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Time (PST) 

(b) Link 



Figure 5: Time series of true flow counts (in triangle) and S-bitmap estimates (in 
dotted line) per minute on both links during slammer outbreak: link 1 (a) and link 
0(b). 

7 Experimental evaluation 

We now evaluate the S-bitmap algorithm on a few real network data and also com- 
pare it with the three competitors as above. 

7.1 Worm traffic monitoring 

We first evaluate the algorithms on worm traffic data, using two 9-hours traffic 
traces (www.rbeverly.net/research/slammer). The traces were collected by MIT 
Laboratory for Computer Science from a peering exchange point (two independent 
links, namely link and link 1) on Jan 25th 2003, during the period of "Slammer" 
worm outbreak. We report the results of estimating flow counts for each link. We 
take A^ = 10^, which is sufficient for most university traffic in normal scenarios. 
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(a) Link 1 (b) Link 

Figure 6: Proportions of estimates (y-axis) that have RRMSE more than a threshold 
(x-axis) based on S-bitmap, mr-bitmap, LogLog and Hyper-LogLog, respectively 
on the two links during slammer outbreak: link 1 (a) and link (b), where the 
three vertical lines show 2, 3 and 4 times expected standard deviation for S-bitmap 
separately. 

Since in practice routers may not allocate much resource for flow counting, we use 
m = 8000 bits. According to dV]), we obtain C = 2026.55 for designing the sam- 
pling rates for S-bitmap, which corresponds to an expected standard deviation of 
e = 2.2% for S-bitmap. The same memory is used for other algorithms. The two 
panels of Figure [5] show the time series of flow counts every minute interval in tri- 
angles on link 1 and link respectively, and the corresponding S-bitmap estimates 
in dashed lines. Occasionally the flows become very bursty (an order of difference), 
probably due to a few heavy worm scanners, while most times the time series are 
pretty stable. The estimation errors of the S-bitmap estimates are almost invisible 
despite the non-stationary and bursty points. 
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The performance comparison between S-bitmap and alternative methods is re- 
ported in Figure [6] (left for Link 1 and right for Link 0), where y-axis is the pro- 
portion of estimates that have absolute relative estimation errors more than a given 
threshold in the x-axis. The three thin vertical lines show the 2, 3 and 4 times ex- 
pected standard deviation for S-bitmap, respectively. For example, the proportion 
of S-bitmap estimates whose absolute relative errors are more than 3 times the ex- 
pected standard deviation is almost on both links, while for the competitors, the 
proportions are at least 1.5% given the same threshold. The results show that S- 
bitmap is most resistant to large errors among all four algorithms for both Link 1 
and Link 0. 
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Number of flows (log base 2) 

Figure 7: Histogram of five-minute flow counts on backbone links (log base 2). 



7.2 Flow traffic on backbone network links 

Now we apply the algorithms for counting network link flows in a core network. 
The real data was obtained from a Tier-1 US service provider for 600 backbone 
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Figure 8: Proportions of estimates (y-axis) that have RRMSE more than a threshold 
(x-axis) based on S-bitmap, mr-bitmap, LogLog and Hyper-LogLog, respectively, 
where the three vertical times show 2, 3 and 4 times expected standard deviation 
for S-bitmap, separately. 

links in the core network, which includes time series of traffic volume in flow counts 
on MPLS (Multi Protocol Label Switching) paths in every five minutes. The traffic 
scales vary dramatically from link to link as well as from time to time. Since the 
original traces are not available, we use simulated data for each link to compute 
S-bitmap and then obtain estimates. We set A^ = L5 x 10^ and use m = 7, 200 bits 
of memory to configure all algorithms as above, which corresponds to an expected 
standard deviation of 2.4% for S-bitmap. The simulation uses a snapshot of a five 
minute interval flow counts, whose histogram in log base 2 is presented in Figure 
111 The vertical lines show that the .1%, 25%, 50%, 75%, and 99% quantiles are 
18, 196, 2817, 19401 and 361485 respectively, where about 10% of the links with 
no flows or flow counts less than 10 are not considered. The performance compar- 
ison between S-bitmap and alternative methods is reported in Figure [8] similar to 



35 



Figure |6l The results show that both S-bitmap and Hyper-LogLog give very accu- 
rate estimates with relative estimation errors bounded by 8%, while mr-bitmap has 
worse performance and LogLog is the worst (off the range). Overall, S-bitmap is 
most resistant to large errors among all four algorithms. For example, the absolute 
relative errors based on S-bitmap are within 3 times the standard deviation for all 
links, while there is one link whose absolute relative error is beyond this threshold 
for Hyper-LogLog, and two such links for mr-bitmap. 

8 Conclusion 

Distinct counting is a fundamental problem in the database literature and has found 
important applications in many areas, especially in modem computer networks. In 
this paper, we have proposed a novel statistical solution (S-bitmap), which is scale- 
invariant in the sense that its relative root mean square error is independent of the 
unknown cardinalities in a wide range. To achieve the same accuracy, with similar 
computational cost, S-bitmap consumes significantly less memory than state-of-the- 
art methods such as multiresolution bitmap, LogLog counting and Hyper-LogLog 
for common practice scales. 
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Appendix 

8. 1 Proof of Lemma [D 

By the definition of {T^ ■ I < k < m}, we have 

P(Tfe - Tk-i = t) 

CO 

= Yl Hn-i = s,Tk = t + s) 

s=k-l 

oo 

= Yl ^(^« = 1' ^t+^ = l,Ls = k-l, U+s = k). 

s=k-l 

Since Ls < Ls+i < ■ ■ ■ < Lg+t, by the Markov chain property of {Lt : t = 
1, • • ■ , }, we have for A; > 1 and s>k — l, 

P(J, = lJt+s = l,L, = k- 1, Lt+s = k) 
= P(L, = k-l,Is = iMLt+s = k\Lt+s-i = k-l) 

s+t-l 

X JJ P(Lj- = k- l|Lj_i = A; - 1) 

j=s+l 

s+t-1 

= P(Tfc_i = s)qk Y[{1- qk) 

j=s+l 

= nTk-i = s)qk{l - qkf-^ . 

Notice that Y17=k-i ^(^fe-i = s) = P(Tfc_i > A; - 1) is probability that the {k - 1)- 
th filled bucket happens when or after the {k — l)-th distinct item arrives, which is 
100% since each distinct item can fill in at most one empty. Therefore 



P(Tfc-Tfc_i=t) = qk{l-qk 



it-i 



That is, Tfc — T^-i follows a geometric distribution. The independence of {T^ 
Tk-i : 1 < A; < m} can be proved similarly using the Markov property of {Lt : t 
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1, 2, ■ ■ ■ }, which we refer to Chapter 3 of lDurretti (Il996|) . This completes the proof 
of Lemma [B 
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